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Abstract 

Sequence-to-sequence translation methods based on gen¬ 
eration with a side-conditioned language model have recently 
shown promising results in several tasks. In machine transla¬ 
tion, models conditioned on source side words have been used 
to produce target-language text, and in image captioning, mod¬ 
els conditioned images have been used to generate caption text. 
Past work with this approach has focused on large vocabulary 
tasks, and measured quality in terms of BLEU. In this paper, 
we explore the applicability of such models to the qualitatively 
different grapheme-to-phoneme task. Here, the input and out¬ 
put side vocabularies are small, plain n-gram models do well, 
and credit is only given when the output is exactly correct. 
We find that the simple side-conditioned generation approach is 
able to rival the state-of-the-art, and we are able to significantly 
advance the stat-of-the-art with bi-directional long short-term 
memory (LSTM) neural networks that use the same alignment 
information that is used in conventional approaches. 

Index Terms: neural networks, grapheme-to-phoneme conver¬ 
sion, sequence-to-sequence neural networks 

1. Introduction 

In recent work on sequence to sequence translation, it has been 
shown that side-conditioned neural networks can be effectively 
used for both machine translation m and image caption¬ 
ing (7]|10|. The use of a side-conditioned language model GD 
is attractive for its simplicity, and apparent performance, and 
these successes complement other recent work in which neu¬ 
ral networks have advanced the state-of-the-art, for example in 
language modeling |12||13| , language understanding pA) , and 
parsing GD- 

In these previously studied tasks, the input vocabulary size 
is large, and the statistics for many words must be sparsely 
estimated. To alleviate this problem, neural network based 
approaches use continuous-space representations of words, in 
which words that occur in similar contexts tend to be close to 
each other in representational space. Therefore, data that ben¬ 
efits one word in a particular context causes the model to gen¬ 
eralize to similar words in similar contexts. The benefits of us¬ 
ing neural networks, in particular, both simple recurrent neural 
networ ks |16| and long short-term memory (LSTM) neural net¬ 
works |17|419| , to deal with sparse statistics are very apparent. 

However, to our best knowledge, the top performing meth¬ 
ods for the grapheme-to-phoneme (G2P) task have been based 
on the use of Kneser-Ney n-gram models |20| . Because of the 
relatively small cardinality of letters and phones, n-gram statis¬ 
tics, even with long context windows, can be reliably trained. 
On G2P tasks, maximum entropy models GD also perform 
well. The G2P task is distinguished in another important way: 


whereas the machine translation and image captioning tasks are 
scored with the relatively forgiving BLEU metric, in the G2P 
task, a phonetic sequence must be exactly correct in order to get 
credit when scored. 

In this paper, we study the open question of whether 
side-conditioned generation approaches are competitive on the 
grapheme-to-phoneme task. We find that LSTM approach pro¬ 
posed by performs well and is very close to the state-of- 
the-art. While the side-conditioned LSTM approach does not 
require any alignment information, the state-of-the-art “gra- 
phone” method of |20| is based on the use of alignments. We 
find that when we allow the neural network approaches to also 
use alignment information, we significantly advance the state- 
of-the-art. 

The remainder of the paper is structured as follows. We 
review previous methods in Sec. We then present side- 
conditioned generation models in Sec.|^ and models that lever¬ 
age alignment information in Sec. We present experimen¬ 
tal results in Sec.|^and provide a further comparison with past 
work in Sec.|^ We conclude in Sec.|7] 

2. Background 

This section summarizes the state-of-the-art solution for G2P 
conversion. The G2P conversion can be viewed as translating an 
input sequence of graphemes (letters) to an output sequence of 
phonemes. Often, the grapheme and phoneme sequences have 
been aligned to form joint grapheme-phoneme units. In these 
alignments, a grapheme may correspond to a null phoneme with 
no pronunciation, a single phoneme, or a compound phoneme. 
The compound phoneme is a concatenation of two phonemes. 
An example is given in Table [T] 

Letters T A N G L E 

Phonemes T AE NG G AH:L null 

Table 1: An example of an alignment of letters to phonemes. 
The letter L aligns to a compound phoneme, and the letter E to 
a null phoneme that is not pronounced. 

Given a grapheme sequence L = h, - ■ ■ correspond¬ 

ing phoneme sequence P = pi, • • • ,pt, and an alignment A, 
the posterior probability p{P\L, A) is approximated as: 

p{P\A,L) « Ylp{pt\plZl,lllt) (1) 

t=i 

where k is the size of a context window, and t indexes the posi¬ 
tions in the alignment. 
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Figure 1: An encoder-decoder LSTM with two layers. The en¬ 
coder LSTM, to the left of the dotted line, reads a time-reversed 
sequence “(s) T A C” and produces the last hidden layer acti¬ 
vation to initialize the decoder LSTM. The decoder LSTM, to 
the right of the dotted line, reads “(os) K AE T” as the past 
phoneme prediction sequence and uses ”K AE T {/os}” as the 
output sequence to generate. Notice that the input sequence for 
encoder LSTM is time reversed, as in j^. (s) denotes letter-side 
sentence beginning, (os) and (/ os) are the output-side sentence 
begin and end symbols. 


Following (2T]l2g,Eq.{T} can be estimated using an expo¬ 
nential (or maximum entropy) model in the form of 


p{pt\x 


(P 


t-1 jt + k 
t — fc’— k 


)) 




( 2 ) 


where features /;(■) are usually 0 or 1 indicating the identities 
of phones and letters in specific contexts. 

Joint modeling has been proposed for grapheme-to- 
phoneme conversion |20[|21|[23) . In these models, one has a 
vocabulary of grapheme and phoneme pairs, which are called 
graphones. The probability of a graphone sequence is 


T 

p{C = Cl • • • ct) = ]^p(ct|ci • • • C£_i), (3) 

£ = 1 


<os> K AE T </os> 
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Figure 2: The uni-directional LSTM reads letter sequence “(s) 
CAT {/s)” and past phoneme prediction “(os) (os) K AE 
T”. It outputs phoneme sequence “(os) K AE T (/os)”. Note 
that there are separate output-side begin and end-of-sentence 
symbols, prefixed by ”o”. 


is a decoder that functions as a language model and generates 
the output. The encoder is used to represent the entire input se¬ 
quence in the last-time hidden layer activities. These activities 
are used as the initial activities of the decoder network. The 
decoder is a language model that uses past phoneme sequence 
to predict the next phoneme />*, with its hidden state ini¬ 
tialized as described. It stops predicting after outputting (/os), 
the output-side end-of-sentence symbol. Note that in our mod¬ 
els, we use (s) and {/s) as input-side begin-of-sentence and 
end-of-sentence tokens, and (os) and (/os) for corresponding 
output symbols. 

To train these encoder and decoder networks, we used back- 
propagation through time (BPTT) |28|29| , with the error signal 
originating in the decoder network. 

We use a beam search decoder to generate phoneme se¬ 
quence during the decoding phase. The hypothesis sequence 
with the highest posterior probability is selected as the decod¬ 
ing result. 


where each c is a graphone unit. The conditional probability 
p(ct|ci • • • Ct-i) is estimated using an n-gram language model. 

To date, these models have produced the best performance 
on common benchmark datasets, and are used for comparison 
with the architectures in the following sections. 


4. Alignment Based Models 

In this section, we relax the earlier constraint that the model 
translates directly from the source-side letters to the target-side 
phonemes without the benefit of an explicit alignment. 


3. Side-conditioned Generation Models 

In this section, we explore the use of side-conditioned language 
models for generation. This approach is appealing for its sim¬ 
plicity, and especially because no explicit alignment informa¬ 
tion is needed. 


4.1. Uni-directional LSTM 

A model of the uni-directional LSTM is in Figure]^ Given a 
pair of source-side input and target-side output sequences and 
an alignment A, the posterior probability of output sequence 
given the input sequence is 


3.1. Encoder-decoder LSTM 

In the context of general sequence to sequence learning, the 
concept of encoder and decoder networks has recently been pro¬ 
posed (HIIIIIIM- The main idea is mapping the entire in¬ 
put sequence to a vector, and then using a recurrent neural net¬ 
work (RNN) to generate the output sequence conditioned on the 
encoding vector. Our implementation follows the method in Q, 
which we denote as encoder-decoder LSTM. Figure p^de picts a 
model of this method. As in Q, we use an LSTM |1^ as the 
basic recurrent network unit because it has shown better perfor¬ 
mance than simple RNNs on language understanding ]26| and 
acoustic modeling © tasks. 

In this method, there are two sets of LSTMs: one is an en¬ 
coder that reads the source-side input sequence and the other 


T 

p{/a\A,iI) = Wp{(t>t\(j>\~^ ,l\) (4) 

t=i 

where the current phoneme prediction (j)t depends both on its 
past prediction and the input letter sequence It- Because of 
the recurrence in the LSTM, prediction of the current phoneme 
depends on the phoneme predictions and letter sequence from 
the sentence beginning. Decoding uses the same beam search 
decoder described in Sec.[3 

4.2. Bi-directional LSTM 

The bi-directional recurrent neural network was proposed in 
|30|. In this architecture, one RNN processes the input from 
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Figure 3: The bi-directional LSTM reads letter sequence “(s) C 
A T (/s)” for the forward directional LSTM, the time-reversed 
sequence “(/s) TAG (s)” for the backward directional LSTM, 
and past phoneme prediction “(os) (os) K AE T”. It outputs 
phoneme sequence “(os) K AE T {/os}”. 


left-to-right, while another processes it right-to-left. The out¬ 
puts of the two sub-networks are then combined, for example 
being fed into a third RNN. The idea has been used for speech 
recognition pO) and more recently for language understand¬ 
ing ED- Bi-directional LSTMs have been applied to speech 
recognition and machine translation |(g. 

In the bi-directional model, the phoneme prediction de¬ 
pends on the whole source-side letter sequence as follows 

T 

picjillAjf) = (5) 

t=l 

Eigure illustrates this model. Eocusing on the third set 
of inputs, for example, letter It — A is projected to a hidden 
layer, together with the past phoneme prediction 4>t-i ~ K. 
The letter It = A is also projected to a hidden layer in the 
network that runs in the backward direction. The hidden layer 
activation from the forward and backward networks is then used 
as the input to a final network running in the forward direction. 
The output of the topmost recurrent layer is used to predict the 
current phoneme (pt = AE. 

We found that performance is better when feeding the past 
phoneme prediction to the bottom LSTM layer, instead of other 
layers such as the softmax layer. However, this architecture can 
be further extended, e.g., by feeding the past phoneme predic¬ 
tions to both the top and bottom layers, which we may investi¬ 
gate in future work. 

In the figure, we draw one layer of bi-directional LSTMs. In 
Section]^ we also report results for deeper networks, in which 
the forward and backward layers are duplicated several times; 
each layer in the stack takes the concatenated outputs of the 
forward-backward networks below as its input. 

Note that the backward direction LSTM is independent of 
the past phoneme predictions. Therefore, during decoding, we 
first pre-compute its activities. We then treat the output from the 
backward direction LSTM as additional input to the top-layer 
LSTM that also has input from the lower layer forward direction 
LSTM. The same beam search decoder described before can 
then be used. 


5. Experiments 

5.1. Datasets 

Our experiments were conducted on the three US English 
dataset|[] the CMUDict, NetTalk, and Pronlex datasets that 
have been evaluated in |20[|21| . We report phoneme error rate 
(PER) and word error rate (WER) |D In the phoneme error rate 
computation, following |20[|21| , in the case of multiple refer¬ 
ence pronunciations, the variant with the smallest edit distance 
is used. Similarly, if there are multiple reference pronunciations 
for a word, a word error occurs only if the predicted pronuncia¬ 
tion doesn’t match any of the references. 

The CMUDict contains 107877 training words, 5401 vali¬ 
dation words, and 12753 words for testing. The Pronlex data 
contains 83182 words for training, 1000 words for validation, 
and 4800 words for testing. The NetTalk data contains 14985 
words for training and 5002 words for testing, and does not have 
a validation set. 

5.2. Training details 

For the CMUDict and Pronlex experiments, all meta-parameters 
were set via experimentation with the validation set. For the 
NetTalk experiments, we used the same model structures as 
with the Pronlex experiments. 

To generate the alignments used for training the alignment- 
based methods of Sec.|^ we used the alignment package of (32|. 
We used BPTT to train the LSTMs. We used sentence level 
minibatches without truncation. To speed-up training, we used 
data parallelism with 100 sentences per minibatch, except for 
the CMUDict data, where one sentence per minibatch gave the 
best performance on the development data. For the alignment- 
based methods, we sorted sentences according to their lengths, 
and each minibatch had sentences with the same length. For 
encoder-decoder LSTMs, we didn’t sort sentences in the same 
lengths as done in the alignment-based methods, and instead, 
followed 0. 

For the encoder-decoder LSTM in Sec. we used 500 di¬ 
mensional projection and hidden layers. When increasing the 
depth of the encoder-decoder LSTMs, we increased the depth 
of both encoder and decoder networks. For the bi-directional 
LSTMs, we used a 50 dimensional projection layer and 300 
dimensional hidden layer. For the uni-directional LSTM ex¬ 
periments on CMUDict, we used a 400 dimensional projection 
layer, 400 dimensional hidden layer, and the above described 
data parallelism. 

For both encoder-decoder LSTMs and the alignment-based 
methods, we randomly permuted the order of the training sen¬ 
tences in each epoch. We found that the encoder-decoder LSTM 
needed to start from a small learning rate, approximately 0.007 
per sample. For bi-directional LSTMs, we used initial learn¬ 
ing rates of 0.1 or 0.2. For the uni-directional LSTM, the ini¬ 
tial learning rate was 0.05. The learning rate was controlled 
by monitoring the improvement of cross-entropy scores on val¬ 
idation sets. If there was no improvement of the cross-entropy 
score, we halved the learning rate. NetTalk dataset doesn’t have 
a validation set. Therefore, on NetTalk, we first ran 10 itera¬ 
tions with a fixed per-sample learning rate of 0.1, reduced the 
learning rate by half for 2 more iterations, and finally used 0.01 
for 70 iterations. 

* We thank Stanley F. Chen who kindly shared the data set partition 

he used in ED- 

^We observed a strong correlation of BLEU and WER scores on 
these tasks. Therefore we didn’t report BLEU scores in this paper. 




































Data 

Method 

PER (%) 

WER (%) 

CMUDict 

past results I 
bi-directional L 

20 

"M 

5.88 

5.45 

24.53 

23.55 

NetTalk 

past results 120 
bi-directional 

M 

8.26 

7.38 

33.67 

30.77 

Pronlex 

past results 1 20[|21| 
bi-directiona LSiM 

6.78 

6.51 

27.33 

26.69 


Method 

PER (%) 

WER (%) 

encoder-decoder LSTM 

7.53 

29.21 

encoder-decoder LSTM (2 layers) 

7.63 

28.61 

uni-directional LSTM 

8.22 

32.64 

uni-directional LSTM (window size 6) 

6.58 

28.56 

bi-directional LSTM 

5.98 

25.72 

bi-directional LSTM (2 layers) 

5.84 

25.02 

bi-directional LSTM (3 layers) 

5.45 

23.55 


Table 3: The PERs and WERs using bi-directional LSTM in 
Table 2: Results on the CMUDict dataset. comparison to the previous best performances in the literature. 


The models of Secs.|^and|^require using a beam search de¬ 
coder. Based on validation results, we report results with beam 
width of 1.0 in likelihood. We did not observe an improvement 
with larger beams. Unless otherwise noted, we used a window 
of 3 letters in the models. We plan to release our training recipes 
to public through computation network toolkit (CNTK) |33| . 

5.3. Results 

We first report results for all our models on the CMUDict 
dataset dD The first two lines of Table show results for the 
encoder-decoder models. While the error rates are reasonable, 
the best previously reported results of 24.53% WER p0| are 
somewhat better. It is possible that combining multiple systems 
as in © would achieve the same result, we have chosen not to 
engage in system combination. 

The effect of using alignment based models is shown at 
the bottom of Table Here, the bi-directional models produce 
an unambiguous improvement over the earlier models, and by 
training a three-layer bi-directional LSTM, we are able to sig¬ 
nificantly exceed the previous state-of-the-art. 

We noticed that the uni-directional LSTM with default win¬ 
dow size had the highest WER, perhaps because one does not 
observe the entire input sequence as is the case with both the 
encoder-decoder and bi-directional LSTMs. To validate this 
claim, we increased the window size to 6 to include the cur¬ 
rent and five future letters as its source-side input. Because 
the average number of letters is 7.5 on CMUDict dataset, the 
uni-directional model in many cases thus sees the entire letter 
sequences. With a window size of 6 and additional informa¬ 
tion from the alignments, the uni-directional model was able to 
perform better than the encoder-decoder LSTM. 

5.4. Comparison with past results 

We now present additional results for the NetTalk and Pron- 
lex datasets, and compare with the best previous results. The 
method of |20) uses 9-gram graphone models, and (H) uses 
8-gram maximum entropy model. 

Changes in WER of 0.77, 1.30, and 1.27 for CMUDict, 
NetTalk and Pronlex datasets respectively are significant at the 
95% confidence level. For PER, the corresponding values are 
0.15, 0.29, and 0.28. On both the CMUDict and NetTalk 
datasets, the bi-directional LSTM outperforms the previous re¬ 
sults at the 95% significance level. 

6. Related Work 

Grapheme-to-phoneme has important applications in text-to- 
speech and speech recognition. It has been well studied in the 
past decades. Although many methods have been proposed in 
the past, the best performance on the standard dataset so far 


was achieved using a joint sequence model (20| of grapheme- 
phoneme joint multi-gram or graphone, and a maximum en¬ 
tropy model 

To our best knowledge, our methods are the first sin¬ 
gle neural-network-based system that outperform the previous 
state-of-the-art methods (20|^ on these common datasets. It is 
possible to improve performances by combining multiple sys¬ 
tems and methods |34||35| , we have chosen not to engage in 
building hybrid models. 

Our work can be cast in the general sequence to sequence 
translation category, which includes tasks such as machine 
translation and speech recognition. Therefore, perhaps the most 
closely related work is However, instead of the marginal 
gains in their bi-direction models, our model obtained signifi¬ 
cant gains from using bi-direction information. Also, their work 
doesn’t include experimenting with deeper structures, which we 
found beneficial. We plan to conduct machine translation tasks 
to compare our models and theirs. 

7. Conclusion 

In this paper, we have applied both encoder-decoder neural 
networks and alignment based models to the grapheme-to- 
phoneme task. The encoder-decoder models have the signifi¬ 
cant advantage of not requiring a separate alignment step. Per¬ 
formance with these models comes close to the best previous 
alignment-based results. When we go further, and inform a bi¬ 
directional neural network models with alignment information, 
we are able to make significant advances over previous meth¬ 
ods. 
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